402 research outputs found

    Error and Error Mitigation in Low-Coverage Genome Assemblies

    Get PDF
    The recent release of twenty-two new genome sequences has dramatically increased the data available for mammalian comparative genomics, but twenty of these new sequences are currently limited to ~2× coverage. Here we examine the extent of sequencing error in these 2× assemblies, and its potential impact in downstream analyses. By comparing 2× assemblies with high-quality sequences from the ENCODE regions, we estimate the rate of sequencing error to be 1–4 errors per kilobase. While this error rate is fairly modest, sequencing error can still have surprising effects. For example, an apparent lineage-specific insertion in a coding region is more likely to reflect sequencing error than a true biological event, and the length distribution of coding indels is strongly distorted by error. We find that most errors are contributed by a small fraction of bases with low quality scores, in particular, by the ends of reads in regions of single-read coverage in the assembly. We explore several approaches for automatic sequencing error mitigation (SEM), making use of the localized nature of sequencing error, the fact that it is well predicted by quality scores, and information about errors that comes from comparisons across species. Our automatic methods for error mitigation cannot replace the need for additional sequencing, but they do allow substantial fractions of errors to be masked or eliminated at the cost of modest amounts of over-correction, and they can reduce the impact of error in downstream phylogenomic analyses. Our error-mitigated alignments are available for download.National Science Foundation (U.S.) (Faculty Early Career Development grant DBI-0644111)National Science Foundation (U.S.) (Faculty Early Career Development grant DBI-0644282)National Science Foundation (U.S.) (Faculty Early Career Development grant U54 HG004555-01)David & Lucile Packard FoundationDavid & Lucile Packard Foundation (Fellowship for Science and Engineering

    Multi-Objective and Multidisciplinary Design Optimisation of Unmanned Aerial Vehicle Systems using Hierarchical Asynchronous Parallel Multi-Objective Evolutionary Algorithms

    Get PDF
    The overall objective of this research was to realise the practical application of Hierarchical Asynchronous Parallel Evolutionary Algorithms for Multi-objective and Multidisciplinary Design Optimisation (MDO) of UAV Systems using high fidelity analysis tools. The research looked at the assumed aerodynamics and structures of two production UAV wings and attempted to optimise these wings in isolation to the rest of the vehicle. The project was sponsored by the Asian Office of the Air Force Office of Scientific Research under contract number AOARD-044078. The two vehicles wings which were optimised were based upon assumptions made on the Northrop Grumman Global Hawk (GH), a High Altitude Long Endurance (HALE) vehicle, and the General Atomics Altair (Altair), Medium Altitude Long Endurance (MALE) vehicle. The optimisations for both vehicles were performed at cruise altitude with MTOW minus 5% fuel and a 2.5g load case. The GH was assumed to use NASA LRN 1015 aerofoil at the root, crank and tip locations with five spars and ten ribs. The Altair was assumed to use the NACA4415 aerofoil at all three locations with two internal spars and ten ribs. Both models used a parabolic variation of spar, rib and wing skin thickness as a function of span, and in the case of the wing skin thickness, also chord. The work was carried out by integrating the current University of Sydney designed Evolutionary Optimiser (HAPMOEA) with Computational Fluid Dynamics (CFD) and Finite Element Analysis (FEA) tools. The variable values computed by HAPMOEA were subjected to structural and aerodynamic analysis. The aerodynamic analysis computed the pressure loads using a Boeing developed Morino class panel method code named PANAIR. These aerodynamic results were coupled to a FEA code, MSC.Nastran® and the strain and displacement of the wings computed. The fitness of each wing was computed from the outputs of each program. In total, 48 design variables were defined to describe both the structural and aerodynamic properties of the wings subject to several constraints. These variables allowed for the alteration of the three aerofoil sections describing the root, crank and tip sections. They also described the internal structure of the wings allowing for variable flexibility within the wing box structure. These design variables were manipulated by the optimiser such that two fitness functions were minimised. The fitness functions were the overall mass of the simulated wing box structure and the inverse of the lift to drag ratio. Furthermore, six penalty functions were added to further penalise genetically inferior wings and force the optimiser to not pass on their genetic material. The results indicate that given the initial assumptions made on all the aerodynamic and structural properties of the HALE and MALE wings, a reduction in mass and drag is possible through the use of the HAPMOEA code. The code was terminated after 300 evaluations of each hierarchical level due to plateau effects. These evolutionary optimisation results could be further refined through a gradient based optimiser if required. Even though a reduced number of evaluations were performed, weight and drag reductions of between 10 and 20 percent were easy to achieve and indicate that the wings of both vehicles can be optimised

    Accurate reconstruction of insertion-deletion histories by statistical phylogenetics

    Get PDF
    The Multiple Sequence Alignment (MSA) is a computational abstraction that represents a partial summary either of indel history, or of structural similarity. Taking the former view (indel history), it is possible to use formal automata theory to generalize the phylogenetic likelihood framework for finite substitution models (Dayhoff's probability matrices and Felsenstein's pruning algorithm) to arbitrary-length sequences. In this paper, we report results of a simulation-based benchmark of several methods for reconstruction of indel history. The methods tested include a relatively new algorithm for statistical marginalization of MSAs that sums over a stochastically-sampled ensemble of the most probable evolutionary histories. For mammalian evolutionary parameters on several different trees, the single most likely history sampled by our algorithm appears less biased than histories reconstructed by other MSA methods. The algorithm can also be used for alignment-free inference, where the MSA is explicitly summed out of the analysis. As an illustration of our method, we discuss reconstruction of the evolutionary histories of human protein-coding genes.Comment: 28 pages, 15 figures. arXiv admin note: text overlap with arXiv:1103.434

    The Genome Sequence DataBase: towards an integrated functional genomics resource

    Get PDF
    During 1998 the primary focus of the Genome Sequence DataBase (GSDB; http://www.ncgr.org/gsdb ) located at the National Center for Genome Resources (NCGR) has been to improve data quality, improve data collections, and provide new methods and tools to access and analyze data. Data quality has been improved by extensive curation of certain data fields necessary for maintaining data collections and for using certain tools. Data quality has also been increased by improvements to the suite of programs that import data from the International Nucleotide Sequence Database Collaboration (IC). The Sequence Tag Alignment and Consensus Knowledgebase (STACK), a database of human expressed gene sequences developed by the South African National Bioinformatics Institute (SANBI), became available within the last year, allowing public access to this valuable resource of expressed sequences. Data access was improved by the addition of the Sequence Viewer, a platform-independent graphical viewer for GSDB sequence data. This tool has also been integrated with other searching and data retrieval tools. A BLAST homology search service was also made available, allowing researchers to search all of the data, including the unique data, that are available from GSDB. These improvements are designed to make GSDB more accessible to users, extend the rich searching capability already present in GSDB, and to facilitate the transition to an integrated system containing many different types of biological data

    Nascent RNA sequencing reveals a dynamic global transcriptional response at genes and enhancers to the natural medicinal compound celastrol

    Get PDF
    Most studies of responses to transcriptional stimuli measure changes in cellular mRNA concentrations. By sequencing nascent RNA instead, it is possible to detect changes in transcription in minutes rather than hours and thereby distinguish primary from secondary responses to regulatory signals. Here, we describe the use of PRO-seq to characterize the immediate transcriptional response in human cells to celastrol, a compound derived from traditional Chinese medicine that has potent anti-inflammatory, tumor-inhibitory, and obesity-controlling effects. Celastrol is known to elicit a cellular stress response resembling the response to heat shock, but the transcriptional basis of this response remains unclear. Our analysis of PRO-seq data for K562 cells reveals dramatic transcriptional effects soon after celastrol treatment at a broad collection of both coding and noncoding transcription units. This transcriptional response occurred in two major waves, one within 10 min, and a second 40-60 min after treatment. Transcriptional activity was generally repressed by celastrol, but one distinct group of genes, enriched for roles in the heat shock response, displayed strong activation. Using a regression approach, we identified key transcription factors that appear to drive these transcriptional responses, including members of the E2F and RFX families. We also found sequence-based evidence that particular transcription factors drive the activation of enhancers. We observed increased polymerase pausing at both genes and enhancers, suggesting that pause release may be widely inhibited during the celastrol response. Our study demonstrates that a careful analysis of PRO-seq time-course data can disentangle key aspects of a complex transcriptional response, and it provides new insights into the activity of a powerful pharmacological agent

    The UCSC Genome Browser Database: update 2006

    Get PDF
    The University of California Santa Cruz Genome Browser Database (GBD) contains sequence and annotation data for the genomes of about a dozen vertebrate species and several major model organisms. Genome annotations typically include assembly data, sequence composition, genes and gene predictions, mRNA and expressed sequence tag evidence, comparative genomics, regulation, expression and variation data. The database is optimized to support fast interactive performance with web tools that provide powerful visualization and querying capabilities for mining the data. The Genome Browser displays a wide variety of annotations at all scales from single nucleotide level up to a full chromosome. The Table Browser provides direct access to the database tables and sequence data, enabling complex queries on genome-wide datasets. The Proteome Browser graphically displays protein properties. The Gene Sorter allows filtering and comparison of genes by several metrics including expression data and several gene properties. BLAT and In Silico PCR search for sequences in entire genomes in seconds. These tools are highly integrated and provide many hyperlinks to other databases and websites. The GBD, browsing tools, downloadable data files and links to documentation and other information can be found at

    ContDist: a tool for the analysis of quantitative gene and promoter properties

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The understanding of how promoter regions regulate gene expression is complicated and far from being fully understood. It is known that histones' regulation of DNA compactness, DNA methylation, transcription factor binding sites and CpG islands play a role in the transcriptional regulation of a gene. Many high-throughput techniques exist nowadays which permit the detection of epigenetic marks and regulatory elements in the promoter regions of thousands of genes. However, so far the subsequent analysis of such experiments (e.g. the resulting gene lists) have been hampered by the fact that currently no tool exists for a detailed analysis of the promoter regions.</p> <p>Results</p> <p>We present ContDist, a tool to statistically analyze quantitative gene and promoter properties. The software includes approximately 200 quantitative features of gene and promoter regions for 7 commonly studied species. In contrast to "traditionally" ontological analysis which only works on qualitative data, all the features in the underlying annotation database are quantitative gene and promoter properties.</p> <p>Utilizing the strong focus on the promoter region of this tool, we show its usefulness in two case studies; the first on differentially methylated promoters and the second on the fundamental differences between housekeeping and tissue specific genes. The two case studies allow both the confirmation of recent findings as well as revealing previously unreported biological relations.</p> <p>Conclusion</p> <p>ContDist is a new tool with two important properties: 1) it has a strong focus on the promoter region which is usually disregarded by virtually all ontology tools and 2) it uses quantitative (continuously distributed) features of the genes and its promoter regions which are not available in any other tool. ContDist is available from <url>http://web.bioinformatics.cicbiogune.es/CD/ContDistribution.php</url></p
    • …
    corecore